The goal of this homework is to get more practice with pandas and get practice with clustering on various datasets.
This exercise will be using the Airbnb dataset for NYC called listings.csv. You can download it directly here
a) Produce a Heatmap using the Folium package (you can install it using pip) of the mean listing price per location (lattitude and longitude) over the NYC map. (5 points)
Hints:
index.html - open it in your browser and you'll see the heatmap import pandas as pd
import numpy as np
import matplotlib
print("Peng Huang U50250882 phuang@bu.edu")
airbnb = pd.read_csv('listings.csv',dtype={'license': object})
# Reference https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options
airbnb.head(10)
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Input In [52], in <module> 1 import pandas as pd 2 import numpy as np ----> 3 import matplotlib 4 print("Peng Huang U50250882 phuang@bu.edu") 5 airbnb = pd.read_csv('listings.csv',dtype={'license': object}) ModuleNotFoundError: No module named 'matplotlib'
# https://pandas.pydata.org/docs/user_guide/groupby.html
# df = pd.DataFrame(
# [
# ("bird", "Falconiformes", 389.0),
# ("bird", "Psittaciformes", 24.0),
# ("mammal", "Carnivora", 80.2),
# ("mammal", "Primates", np.nan),
# ("mammal", "Carnivora", 58),
# ],
# index=["falcon", "parrot", "lion", "monkey", "leopard"],
# columns=("class", "order", "max_speed"),
# )
# df
# grouped=df.groupby('class')
# grouped['max_speed'].mean()
from folium.plugins import HeatMap
import folium
grouped = airbnb.groupby(['latitude','longitude'])
grouped.mean() # pandas.core.frame.DataFrame
airbnb_mean_prices=grouped.mean().loc[:,'price'] # pandas.core.series.Series
airbnb_mean_prices
latitude longitude
40.504559 -74.249840 98.0
40.521980 -74.180370 145.0
40.523390 -74.205170 118.0
40.531250 -74.201350 650.0
40.531380 -74.191130 89.0
...
40.910909 -73.894079 70.0
40.911380 -73.896770 120.0
40.911390 -73.903800 37.0
40.911990 -73.849080 1280.0
40.914070 -73.898350 70.0
Name: price, Length: 37165, dtype: float64
'''
References
https://stackoverflow.com/questions/54752175/add-heatmap-to-a-layer-in-folium
https://python-visualization.github.io/folium/plugins.html
'''
import random
coordinates=airbnb_mean_prices.index.tolist()
mean_prices=airbnb_mean_prices.values.tolist()
heat_data=[]
for i in range(len(coordinates)):
heat_data.append([coordinates[i][0],coordinates[i][1],mean_prices[i]])
# heat_data=[[40.504559,-74.249840,98.0],[40.521980 , -74.180370 ,145.0]]
nyc_map = folium.Map([40.693943, -73.985880] , zoom_start=10)
HeatMap(heat_data).add_to(nyc_map)
nyc_map.save("index.html")
nyc_map
b) Normalize the price by subtracting the mean and dividing by the standard deviation. Then reproduce the heatmap from a). Comment on any differences you observe. - (5 points )
airbnb.loc[:,'price'] # pandas.core.series.Series
mean_price=airbnb.loc[:,'price'].mean()
std_price=airbnb.loc[:,'price'].std()
def normalize(price):
return (price-mean_price)/std_price
normalized_prices=airbnb.loc[:,'price'].apply(normalize) # pandas.core.series.Series
airbnb.loc[:,'normalized_price']=normalized_prices
airbnb
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | normalized_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.753560 | -73.985590 | Entire home/apt | 150 | 30 | 48 | 2019-11-04 | 0.33 | 3 | 322 | 0 | NaN | -0.052751 |
| 1 | 3831 | Whole flr w/private bdrm, bath & kitchen(pls r... | 4869 | LisaRoxanne | Brooklyn | Bedford-Stuyvesant | 40.684940 | -73.957650 | Entire home/apt | 73 | 1 | 408 | 2021-06-29 | 4.91 | 1 | 220 | 38 | NaN | -0.316119 |
| 2 | 5121 | BlissArtsSpace! | 7356 | Garon | Brooklyn | Bedford-Stuyvesant | 40.685350 | -73.955120 | Private room | 60 | 30 | 50 | 2016-06-05 | 0.53 | 2 | 365 | 0 | NaN | -0.360584 |
| 3 | 5136 | Spacious Brooklyn Duplex, Patio + Garden | 7378 | Rebecca | Brooklyn | Sunset Park | 40.662650 | -73.994540 | Entire home/apt | 275 | 5 | 2 | 2021-08-08 | 0.02 | 1 | 91 | 1 | NaN | 0.374795 |
| 4 | 5178 | Large Furnished Room Near B'way | 8967 | Shunichi | Manhattan | Midtown | 40.764570 | -73.983170 | Private room | 68 | 2 | 505 | 2021-10-20 | 3.70 | 1 | 218 | 31 | NaN | -0.333221 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 37708 | 53123149 | Lovely Room in Bedford-Stuyvesant Apartment | 305240193 | June | Brooklyn | Bedford-Stuyvesant | 40.682389 | -73.955540 | Private room | 65 | 30 | 0 | NaN | NaN | 391 | 364 | 0 | NaN | -0.343482 |
| 37709 | 53123691 | Unfurnished Room in West Harlem Apartment | 305240193 | June | Manhattan | Upper West Side | 40.801062 | -73.961581 | Private room | 58 | 30 | 0 | NaN | NaN | 391 | 364 | 0 | NaN | -0.367425 |
| 37710 | 53123840 | MASSIVE 8BR/8BTH Brooklyn Townhouse w/ Backyard | 13603829 | Kay | Brooklyn | Bushwick | 40.682358 | -73.908384 | Entire home/apt | 914 | 1 | 0 | NaN | NaN | 7 | 358 | 0 | NaN | 2.560410 |
| 37711 | 53126354 | Unfurnished Room in West Harlem Apartment | 305240193 | June | Manhattan | Upper West Side | 40.800454 | -73.963746 | Private room | 66 | 30 | 0 | NaN | NaN | 391 | 364 | 0 | NaN | -0.340062 |
| 37712 | 53127631 | Bright Room in West Harlem Apartment | 305240193 | June | Manhattan | Upper West Side | 40.799822 | -73.966022 | Private room | 65 | 30 | 0 | NaN | NaN | 391 | 364 | 0 | NaN | -0.343482 |
37713 rows × 19 columns
grouped = airbnb.groupby(['latitude','longitude'])
grouped.mean() # pandas.core.frame.DataFrame
airbnb_mean_normalized_prices=grouped.mean().loc[:,'normalized_price'] # pandas.core.series.Series
coordinates=airbnb_mean_normalized_prices.index.tolist()
normalized_mean_prices=airbnb_mean_normalized_prices.values.tolist()
normalized_heat_data=[]
for i in range(len(coordinates)):
normalized_heat_data.append([coordinates[i][0],coordinates[i][1],normalized_mean_prices[i]])
# heat_data=[[40.504559,-74.249840,98.0],[40.521980 , -74.180370 ,145.0]]
nyc_map = folium.Map([40.693943, -73.985880] , zoom_start=10)
HeatMap(normalized_heat_data).add_to(nyc_map)
nyc_map.save("index_normalized.html")
nyc_map
-> your answer here
After normalization, some low-price points (like near Newark) can be clearly indicated in the heat map, compared to the un-normalized one from 1(a).
Below is normalized heatmap from 1(b)
Below is un-normalized heatmap from 1(a)
c) Normalize the original price using sklearn's MinMaxScaler to the interval [0,1]. Then reproduce the Heatmap from a). Comment on any differences you observe. - (5 points)
# Reference https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
#
from sklearn.preprocessing import MinMaxScaler
airbnb_1c = pd.read_csv('listings.csv',dtype={'license': object})
#data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler() # sklearn.preprocessing._data.MinMaxScaler
airbnb_series_of_prices=airbnb_1c.loc[:,'price']
print(airbnb_series_of_prices)
airbnb_df_of_prices=airbnb_series_of_prices.to_frame()
# https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html
print(airbnb_df_of_prices)
scaler.fit(airbnb_df_of_prices) # 類似於訓練一個 Model
prices_scaled=scaler.transform(airbnb_df_of_prices) # 類似於用一個 model 做 predict
print(prices_scaled)
airbnb.loc[:,"scaled_price"]=prices_scaled
print(airbnb)
grouped_1c = airbnb.groupby(['latitude','longitude'])
grouped_1c.mean() # pandas.core.frame.DataFrame
series_of_mean_prices=grouped_1c.mean().loc[:,'scaled_price'] # pandas.core.series.Series
print(series_of_mean_prices)
coordinates_1c=series_of_mean_prices.index.tolist()
mean_prices_1c=series_of_mean_prices.values.tolist()
heat_data_1c=[]
for i in range(len(coordinates_1c)):
heat_data_1c.append([coordinates_1c[i][0],coordinates_1c[i][1],mean_prices_1c[i]])
#temp_heat_data=[[40.504559,-74.249840,1],[40.521980 , -74.180370 ,0.8]]
nyc_map_1c = folium.Map([40.693943, -73.985880] , zoom_start=10)
#print(heat_data_1c)
HeatMap(heat_data_1c).add_to(nyc_map_1c)
nyc_map_1c.save("index_1c.html")
nyc_map_1c
0 150
1 73
2 60
3 275
4 68
...
37708 65
37709 58
37710 914
37711 66
37712 65
Name: price, Length: 37713, dtype: int64
price
0 150
1 73
2 60
3 275
4 68
... ...
37708 65
37709 58
37710 914
37711 66
37712 65
[37713 rows x 1 columns]
[[0.015 ]
[0.0073]
[0.006 ]
...
[0.0914]
[0.0066]
[0.0065]]
id name host_id \
0 2595 Skylit Midtown Castle 2845
1 3831 Whole flr w/private bdrm, bath & kitchen(pls r... 4869
2 5121 BlissArtsSpace! 7356
3 5136 Spacious Brooklyn Duplex, Patio + Garden 7378
4 5178 Large Furnished Room Near B'way 8967
... ... ... ...
37708 53123149 Lovely Room in Bedford-Stuyvesant Apartment 305240193
37709 53123691 Unfurnished Room in West Harlem Apartment 305240193
37710 53123840 MASSIVE 8BR/8BTH Brooklyn Townhouse w/ Backyard 13603829
37711 53126354 Unfurnished Room in West Harlem Apartment 305240193
37712 53127631 Bright Room in West Harlem Apartment 305240193
host_name neighbourhood_group neighbourhood latitude \
0 Jennifer Manhattan Midtown 40.753560
1 LisaRoxanne Brooklyn Bedford-Stuyvesant 40.684940
2 Garon Brooklyn Bedford-Stuyvesant 40.685350
3 Rebecca Brooklyn Sunset Park 40.662650
4 Shunichi Manhattan Midtown 40.764570
... ... ... ... ...
37708 June Brooklyn Bedford-Stuyvesant 40.682389
37709 June Manhattan Upper West Side 40.801062
37710 Kay Brooklyn Bushwick 40.682358
37711 June Manhattan Upper West Side 40.800454
37712 June Manhattan Upper West Side 40.799822
longitude room_type price minimum_nights number_of_reviews \
0 -73.985590 Entire home/apt 150 30 48
1 -73.957650 Entire home/apt 73 1 408
2 -73.955120 Private room 60 30 50
3 -73.994540 Entire home/apt 275 5 2
4 -73.983170 Private room 68 2 505
... ... ... ... ... ...
37708 -73.955540 Private room 65 30 0
37709 -73.961581 Private room 58 30 0
37710 -73.908384 Entire home/apt 914 1 0
37711 -73.963746 Private room 66 30 0
37712 -73.966022 Private room 65 30 0
last_review reviews_per_month calculated_host_listings_count \
0 2019-11-04 0.33 3
1 2021-06-29 4.91 1
2 2016-06-05 0.53 2
3 2021-08-08 0.02 1
4 2021-10-20 3.70 1
... ... ... ...
37708 NaN NaN 391
37709 NaN NaN 391
37710 NaN NaN 7
37711 NaN NaN 391
37712 NaN NaN 391
availability_365 number_of_reviews_ltm license scaled_price
0 322 0 NaN 0.0150
1 220 38 NaN 0.0073
2 365 0 NaN 0.0060
3 91 1 NaN 0.0275
4 218 31 NaN 0.0068
... ... ... ... ...
37708 364 0 NaN 0.0065
37709 364 0 NaN 0.0058
37710 358 0 NaN 0.0914
37711 364 0 NaN 0.0066
37712 364 0 NaN 0.0065
[37713 rows x 19 columns]
latitude longitude
40.504559 -74.249840 0.0098
40.521980 -74.180370 0.0145
40.523390 -74.205170 0.0118
40.531250 -74.201350 0.0650
40.531380 -74.191130 0.0089
...
40.910909 -73.894079 0.0070
40.911380 -73.896770 0.0120
40.911390 -73.903800 0.0037
40.911990 -73.849080 0.1280
40.914070 -73.898350 0.0070
Name: scaled_price, Length: 37165, dtype: float64
-> your answer here
As shown below, the contours of the heatmaps are different. The gradation of scaled heatmap is a little bit more apparent than the un-scaled one.
Below is scaled from 1(c)
Below is un-scaled from 1(a)
d) Plot a bar chart of the average price (un-normalized) per room type. Briefly comment on the relation between price and room type. - (2.5 points)
# Reference:
# https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.bar.html
#
airbnb_1d = pd.read_csv('listings.csv',dtype={'license': object})
grouped_1d = airbnb.groupby('room_type') #pandas.core.groupby.generic.DataFrameGroupBy
mean_df_1d=grouped_1d.mean()
series_of_mean_prices_1d=mean_df_1d.loc[:,'price']
print(series_of_mean_prices_1d)
series_of_mean_prices_1d.plot.bar()
room_type Entire home/apt 217.040971 Hotel room 312.886179 Private room 102.949608 Shared room 129.656250 Name: price, dtype: float64
<AxesSubplot:xlabel='room_type'>
Averagely, hotel rooms have the highest prices and private rooms have the lowest ones. The prices of entire home/apts and shared rooms are intermediate, but entire home/apts have higher prices than shared rooms.
e) Plot on the NYC map the top 10 most expensive listings - (2.5 points)
https://piazza.com/class/kyj3ikj3q27389?cid=213
We are supposed to choose 10 unique markers, so expand your selection to select 10 points with different lat/lon values (I think I had to expand my selection to 12 or 13 points to get 10 unique locations). That should take care of your first question as well. ~ An instructor (Saurav vara prasad chennuri) endorsed this answer ~
# Reference: df.groupby(['Mt'], sort=False)['count'].max()
# Reference: https://python-visualization.github.io/folium/quickstart.html
airbnb_1e = pd.read_csv('listings.csv',dtype={'license': object})
grouped_1e=airbnb_1e.groupby(['latitude','longitude'])
series_of_max_prices_1e=grouped_1e['price'].max()
series_of_largest_prices_1e=series_of_max_prices_1e.nlargest(10,keep="all")
nyc_map_1e = folium.Map([40.693943, -73.985880] , zoom_start=10)
coordinates_1e=series_of_largest_prices_1e.index.tolist()
for i in range(len(coordinates_1e)):
folium.Marker(location=list(coordinates_1e[i])).add_to(nyc_map_1e)
nyc_map_1e.save("index_1e.html")
nyc_map_1e
f) Plot on the NYC map the top 10 most reviewed listings - (2.5 points)
https://piazza.com/class/kyj3ikj3q27389?cid=213
We are supposed to choose 10 unique markers, so expand your selection to select 10 points with different lat/lon values (I think I had to expand my selection to 12 or 13 points to get 10 unique locations). That should take care of your first question as well. ~ An instructor (Saurav vara prasad chennuri) endorsed this answer ~
airbnb_1f = pd.read_csv('listings.csv',dtype={'license': object})
grouped_1f=airbnb_1f.groupby(['latitude','longitude'])
series_of_max_reviews_1f=grouped_1f['number_of_reviews'].max()
series_of_largest_reviews_1f=series_of_max_reviews_1f.nlargest(10,keep="all")
nyc_map_1f = folium.Map([40.693943, -73.985880] , zoom_start=10)
coordinates_1f=series_of_largest_reviews_1f.index.tolist()
for i in range(len(coordinates_1f)):
folium.Marker(location=list(coordinates_1f[i])).add_to(nyc_map_1f)
nyc_map_1f.save("index_1f.html")
nyc_map_1f
g) Plot on the NYC map the top 10 most available listings - (2.5 points)
https://piazza.com/class/kyj3ikj3q27389?cid=213
We are supposed to choose 10 unique markers, so expand your selection to select 10 points with different lat/lon values (I think I had to expand my selection to 12 or 13 points to get 10 unique locations). That should take care of your first question as well. ~ An instructor (Saurav vara prasad chennuri) endorsed this answer ~
airbnb_1g = pd.read_csv('listings.csv',dtype={'license': object})
grouped_1g=airbnb_1g.groupby(['latitude','longitude'])
series_of_max_availability_1g=grouped_1g['availability_365'].max()
series_of_largest_availability_1g=series_of_max_availability_1g.nlargest(10,keep="first")
nyc_map_1g = folium.Map([40.693943, -73.985880] , zoom_start=10)
coordinates_1g=series_of_largest_availability_1g.index.tolist()
for i in range(len(coordinates_1g)):
folium.Marker(location=list(coordinates_1g[i])).add_to(nyc_map_1g)
nyc_map_1g.save("index_1g.html")
nyc_map_1g
h) Using longitude, latitude, price, and number_of_reviews, use Kmeans to create 5 clusters. Plot the points on the NYC map in a color corresponding to their cluster. - (5 points)
i) You should see points in the same cluster all over the map - briefly explain why that is. - (2.5 points)
-> your answer here
j) How many clusters would you recommend using instead of 5? Display and interpret either the silhouette scores or the elbow method. - (5 points)
-> your answer here
k) Would you recommend normalizing the price and number of reviews? Briefly explain why. - (2.5 points)
-> your answer here
l) For all listings of type Shared room, plot the dendrogram of the hierarchical clustering generated from longitude, latitude, and price. - (5 points)
m) briefly comment on what you observe from the structure of the dendrogram. - (2.5 points)
-> your answer here
n) Normalize the price as in b) and repeat l) - (2.5 points)
This exercise will be using the mnist dataset.
a) Using Kmeans, cluster the images using 10 clusters and plot the centroid of each cluster. - (10 points)
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
mnist = load_digits()
# your code here
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Input In [5], in <module> 1 import pandas as pd ----> 2 import matplotlib.pyplot as plt 4 from sklearn.cluster import KMeans 5 from sklearn.datasets import load_digits ModuleNotFoundError: No module named 'matplotlib'
b) what is the disagreement distance between the clustering you created above and the clustering created by the labels attached to each image? Briefly explain what this number means in this context. - (10 points)
c) Download the CIFAR-10 dataset here. Open batch_1 by following the documentation on the web page. Plot a random image from the dataset. - (10 points)
d) This image is 32 x 32 pixels and each pixel is a 3-dimensional object of RGB (Red, Green, Blue) intensities. Using the same image as in c), produce an image that only uses 4 colors (the 4 centroids of the clusters obtained by clustering the image itself using Kmeans). - (10 points)
e) Write a function that applies this transformation to the entire dataset for any number K of colors. - (10 points)